Skip to content

Conversation

@CorentynDevPro
Copy link
Owner

No description provided.

Introduces APPLY_MIGRATIONS.md, a comprehensive runbook for safely applying schema migrations to production and staging environments. Includes checklists, commands, verification queries, troubleshooting steps, rollback procedures, and operational guidance for SREs and engineers.
Add runbook for applying database migrations
Introduces BACKFILL.md, a comprehensive runbook detailing procedures for safely backfilling historical hero_snapshots into normalized tables. Covers planning, execution, validation, monitoring, troubleshooting, rollback, and post-backfill tasks for SRE, engineering, and QA teams.
Add backfill runbook for historical snapshots
Introduces a detailed runbook for diagnosing and mitigating Postgres connection exhaustion incidents. Provides emergency steps, diagnostics, mitigation strategies, permanent remediation actions, and references for SREs and backend engineers.
Add Postgres connection exhaustion runbook
Introduces DB_RESTORE.md with detailed procedures for restoring the StarForge PostgreSQL database, including snapshot, logical dump, and point-in-time recovery workflows. Provides checklists, validation steps, troubleshooting, and communication templates for incident response.
Add database restore runbook documentation
Introduces a new runbook document for handling ETL failure spikes in the docs/OP_RUNBOOKS directory.
Introduces a comprehensive runbook for triaging, mitigating, and resolving sudden spikes in ETL worker failures for StarForge. Includes checklists, Prometheus queries, mitigation steps, common failure classes, communication guidelines, recovery procedures, and post-incident actions to support SREs and backend engineers during ETL incidents.
Introduces MIGRATION_ROLLBACK.md to document procedures for rolling back database migrations.
Introduces a comprehensive runbook for safely rolling back problematic database migrations. The guide covers triage, rollback strategies (down migration, restore from backup, app revert), verification steps, communication protocols, and post-incident actions to ensure data integrity and minimize downtime.
Introduced a new runbook in the documentation to outline procedures for handling secret compromise incidents.
Introduces a comprehensive runbook for handling suspected or confirmed secret and credential leaks. Covers immediate containment, rotation, forensics, investigation, recovery, communication, verification, and post-incident hardening steps for various secret types and providers.
Introduces a new runbook file for handling worker out-of-memory (OOM) issues in the documentation.
Introduces WORKER_OOM.md, a comprehensive runbook for triaging, mitigating, and recovering from Out-Of-Memory incidents affecting ETL and background workers. The document covers immediate containment, diagnostic steps, Kubernetes commands, code and workflow mitigations, resource tuning, and post-incident actions to improve reliability and prevent future OOM events.
Removed file extension from the runbook title for consistency with other documentation headers.
Update WORKER_OOM runbook title formatting
Introduces a new runbook file for QUEUE_BACKLOG in the OP_RUNBOOKS directory. This file will be used to document procedures and information related to queue backlog operations.
Updated the top-level titles in several documentation files to remove the file extension suffixes for consistency and improved readability.
Introduces a comprehensive runbook for triaging, mitigating, and resolving queue backlogs in StarForge. Covers detection, immediate actions, root cause analysis, safe scaling, dead-letter queue handling, recovery, and long-term prevention for Redis/BullMQ and DB-backed queues.
@CorentynDevPro CorentynDevPro self-assigned this Dec 4, 2025
@CorentynDevPro CorentynDevPro merged commit 834235e into main Dec 4, 2025
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants